motivating case study
linear models
regularized linear models
Describe the theoretical foundation of intrinsically interpretable models like sparse regression, gaussian processes, and classification and regression trees, and apply them to realistic case studies with appropriate validation checks.
Compare the competing definitions of interpretable machine learning, the motivations behind them, and metrics that can be used to quantify whether they have been met.
Patients with the same diagnosed cancer often respond very differently to the same drug. How can we figure out which drugs any particular patient will respond to?
If drug effectiveness = f(gene activity), then one approach is to measure gene activity in the patient’s cancer tissue samples.
Features in that model can be used to stratify patients into responder/non-responder subtypes.
The study (Dietrich et al. 2017) measured drug responses in samples from primary patients who were being treated for blood cancer (CLL). They simultaneously measured gene expression and DNA methylation activity, then saw whether the cells were killed by antitumor drugs.
Features
Outcome
Which genomic features differentiate between drug sensitivity vs. resistance?
Viability can be viewed as a response variable \(\mathbf{y} \in \mathbf{R}^{N}\), and the molecular variables can be treated as features \(\mathbf{X} \in \mathbf{R}^{N \times J}\). Here
The setting is high-dimensional with fewer samples (\(N = 121\)) than features (\(J = 9553\)). Without regularization, the problem is underdetermined.
Sparsity will help us focus on the most important pathways out of thousands of candidates.
\[\begin{align*} y_i=\beta_0+x_{i 1} \beta_1+\epsilon_i \end{align*}\]
The least-squares estimate \(\hat{\beta} := \left(\hat{\beta}_{0}, \hat{\beta}_{1}\right)\) is found by minimizing
\[\begin{align*} \min_{\beta_0, \beta_1} \sum_{i = 1}^{N}\left(y_i-\beta_0-x_{i} \beta_1\right)^2 \end{align*}\]
In the reading, model house price \(y_{i}\) as a function of house area \(x_{i} \in \mathbf{R}\).
In the case study, model viability \(y_{i}\) for sample \(i\) is a linear function of a single gene’s expression level \(x_{i} \in \mathbf{R}\):
Each choice of \(\beta_{0}, \beta_{1}\) is associated with a different straight line and a different loss value.
Each choice of \(\beta_{0}, \beta_{1}\) is associated with a different straight line and a different loss value.
For a given dataset, loss value across all choices of \(\beta_{0}, \beta_{1}\) is a quadratic function. The minimizer is the least squares solution.
If a variable includes \(K\) categories, it can be one-hot encoded into \(K - 1\) binary columns,
\[\begin{align*} x_{ik} = \mathbf{1}\{\text{sample } i \text{ belongs to level } k\} \end{align*}\]
The left-out category is the reference level.
If a variable includes \(K\) categories, it can be one-hot encoded into \(K - 1\) binary columns,
The left-out category is the reference level.
In the reading, \(x_{i} \in \{\text{Gilbert}, \text{North Ames}, \text{Edwards}, ...\}\) records the neighborhood for house \(i\).
\(\beta_0\): The typical price in the reference “Somerset” neighborhood.
\(\beta_1\): The amount the predicted price changes when moving from “Somerset” to “Gilbert”
\(\beta_{2}\): The amount the predicted price changes when moving from “Somerset” to “NAmes”
and similarly for the remaining neighborhoods.
Assumed model form:
\[\begin{align*} y_{i} &= \sum_{j = 1}^{J}x_{ij}\beta_{j} + \epsilon_{i} \\ &:= \mathbf{x}_{i}^\top \beta + \epsilon_{i} \end{align*}\]
\[\begin{align*} \min_{\beta \in \mathbf{R}^{J}} \sum_{i=1}^N \left( y_i - \mathbf{x}_i^\top \beta \right)^2 \end{align*}\]
We can imagine how \(y\) changes when changing two features simultaneously.
“All other things being equal”.
\(\beta_j\) gives the impact of changing \(x_j\) while every other feature \(k\) in the model is fixed.
In the housing price example,
\[\begin{align*} \text{predicted price} = &-871,630 + 88 \times \text{area} + 19,129 \times \text{quality} + \\ &426 \times \text{year} - 12,667 \times \text{bedroom} \end{align*}\]
For every additional square foot, the price increases by $88, all else held equal.
In the housing price example,
\[\begin{align*} \text{predicted price} = &-871,630 + 88 \times \text{area} + 19,129 \times \text{quality} + \\ & 426 \times \text{year} - 12,667 \times \text{bedroom} \end{align*}\]
When all the features are 0, the predicted price is negative. This makes no sense. But there are also no 0 square foot homes for the model to have learned this.
The coefficient values must be interpreted within the context of all other predictors.
\[\begin{align*} \text{predicted price} = &-871,630 + 88 \times \text{area} + 19,129 \times \text{quality}+ \\ &426 \text{year} - 12,667 \times \text{bedroom} \end{align*}\]
\[\begin{align*} \text{predicted price} = &-750,097 + 37,765 \times \text{quality} + 335 \times \text{year} + \\ & 13,935 \times \text{bedroom} \end{align*}\]
This instability in coefficient interpretations is most severe when predictors are correlated with one another.
In the case study, genes are correlated when they lie on the same pathway. Holding other genes “fixed” is not realistic. Estimates will change if any genes are dropped.
Large coefficient \(\neq\) an important predictor.
Respond to [Linear Model Interpretability] in the exercise sheet.
Regularizing a predictive model means forcing it towards a simpler solution. This is usually achieved by adding penalizers to the optimization objective that used in estimating the model parameters.
When features are correlated, the loss surface has long “valleys” where any of the solutions look equally good.
This can lead to instability in the resulting fits.
One way to address this is to add a an \(\ell^{2}\)-penalty to the least-squares objective.
\[\begin{align*} \min_{\beta \in \mathbf{R}^J} \left[ \frac{1}{2N} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^\top \beta\right)^2 + \lambda \lVert \beta \rVert_{2}^{2} \right]. \end{align*}\] This is the same loss as linear regression, but with a new \(\ell^{2}\) penalty \(\|\beta||_{2}^{2} = \sum_{j} \beta_{j}^{2}\) is the \(\ell^{2}\) norm. \(\lambda \geq 0\) is a tuning parameter controlling model complexity.
This method is called ridge regression. Geometrically, this penalty encourages \(\beta\) to be closer to the origin.
This method is called ridge regression. Geometrically, this penalty encourages \(\beta\) to be closer to the origin.
If we had many noise features (unrelated to response), least squares will still find coefficients for them. This causes overfitting: our predictions depend on irrelevant features.
In the case study, we don’t expect all genes to matter. It’s more likely that there are a few key genes.
Feature selection: Lasso sets many coefficients \(\beta_{j}\) to exactly zero. The \(\ell^{1}\) penalty “induces sparsity.”
The Lasso regression objective is \[\begin{align*} \min_{\beta \in \mathbf{R}^J} \left[ \frac{1}{2N} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^\top \beta\right)^2 + \lambda \lVert \beta \rVert_1 \right]. \end{align*}\] This is like ridge regression but with a new \(\ell^{1}\) penalty
\[\begin{align*} \|\beta\|_{1} := \sum_{j = 1}^{J} \left|\beta_{j}\right| \end{align*}\]
It’s not obvious, but the minimizers often have coordinates \(\beta_{j} = 0\). The “selected” features are those where \(\beta_{j} \neq 0\).
\[\begin{align*} \min_{\beta \in \mathbf{R}^J} \left[ \frac{1}{2N} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^\top \beta\right)^2 + \lambda \lVert \beta \rVert_1 \right]. \end{align*}\]
The minimizers lie in the “creases” where some \(\beta_{j}\) are exactly zero.
The minimizers lie in the “creases” where some \(\beta_{j}\) are exactly zero.
The minimizers lie in the “creases” where some \(\beta_{j}\) are exactly zero.
Respond to the following T/F questions from the reading on linear model extensions. Justify your choices.
The magnitude of the LS coefficient of a predictive feature corresponds to how important the feature is for generating the prediction.
Increasing the number of predictive features in a predictive fit will always improve the predictive performance.
More regularization means that regularized coefficients will be closer to the original un-regularized LS coefficients.